home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Libris Britannia 4
/
science library(b).zip
/
science library(b)
/
BIOLOGY
/
ESEE109E.ZIP
/
READ.LWL
< prev
next >
Wrap
Text File
|
1989-10-27
|
7KB
|
160 lines
Using LWL85 with Esee via Esee2LWL
==================================
The Program LWL85, described by Li, Wu, and Luo, 1985. MBE 2:150-174,
gives two (and sometimes) one-parameter estimates of genetic distance
for protein coding genes. LWL85 is not included with ESEE, however
if you do have it, then the utility program Esee2LWL should simplify
the use of Esee save files with LWL85.
As far as ease of use goes, LWL85 is on a par with Felsenstein's Phylip
programs, except that there are fewer options to worry about, and
this program is a bit more finiky about which columns the data
are in, as befits a Fortran program.
***********************************************************
USING ESEE to make LWL files.
(automatic method)
- Start ESEE and align your sequences. I recommend deleting
any codons that are not found across all of the sequences
being considered.
- Save the file in an ESEE save file
- Leave ESEE
- Run the program Esee2lwl.exe that is on this disk
(manual method)
- Start ESEE and align your sequences.
- carefully put in the LENGTH-space-NAME-space-COMMENT fields
at the very beginning of each sequence.
- insert blanks between those fields and the actual start of the
sequence...the sequence should start at POSITION 82. This is crucial.
- with the cursor at position 82, depress f1 to get triplet spacing
- repeat these steps with each sequence that you wish to output
- Go to the print-out window and change the line length to 80.
- output all of the seqeunces, one at a time to an ASCII file
using ESEE's OUTPUT command. When the prompt for
overwrite, append or abort appears, select append.
- save your work to a save file if you wish, and exit ESEE
**********************************************************************
WHAT DOES Esee2LWL do?
In roughly the following order it:
-prompts you for a file name then inputs the data,
skipping sequences that are type P, T or A and taking only
sequences of type N.
- Aborts if the number of valid sequences is less than 2
- TRIMs the sequence names to 59 characters
- trims the sequence lengths to 2100 characters (if necessary)
- checks for sequence length conflicts
The sequences should all be of uniform length for LWL.
If the (now trimmed) sequences are not of the same length,
then the program generates a report of the lengths of the
first, shortest and longest sequences. You are then prompted
for the sequence length to use. You may specify any integer
ranging from 3 to the length of the longest sequence. If
there are sequences that are already less than the length that
you specify they will be padded with either N's or ?'s (depending
on whether you are working with DNA or protein).
- checks for name conflicts, you are prompted until all of
the names are unique
-prompts you for a name for the output file, you have an
option to escape if the file already exists
-sends data to the output file in the format required by LWL
You can use ? for ambiguous bases and *** for ambiguous codons.
If you use ? then make sure to specify 001 as the last part
of the LWL prompt. For instance to get pairwise distances
between 3 taxa that have some ambiguities you would answer
LWL's prompt 003003001001
^ ^ ^ ^
| | | |
/ | | \
/ | | \
/ | | \
/ / \ \
/ / \ \
/ / \ \
/ / \ \
# sp # pairwise 001 001= throw out condons
in file comparisons or 000 with ambiguous, for all
(two different comparisons in the run
strategies for
confusing a.a
substitutions)
**********************************************************************
I will now attempt to explain the input format of LWL85.
Each sequence begins with the length in nucleotides right justified
in a field consisting of the first twenty spaces of the first line
for that sequence. In plain terms it means that the number expressing the
length must end on column 20. Say the number is 109, where the nine
is in column 19. The program will interpret this number as 1090!
After the length skip a space and put in the name. Then skip
another space and put in an optional lable, if you wish.
Then comes the sequence itself.
Here are the rules:
- column1 is empty.
- the sequence is presented in triplets, 20 triplets per line
- if any of the sequences is missing a residue relative to
any of the others, convert that ENTIRE CODON to ***
- lines are 80 columns wide
- don't include the initiation and termination codons
============================================================================
When you run LWL85 there are series of prompts.
The first prompt asks you to type either ZZ3 or ZZZ3.
This refers to the codon designations and mutational pathway weights.
For rapidly evolving genes it is recommended that you use ZZZ3.
Use ZZ3 for the insulin example.
Then you are asked for the name of the data file.
Next you have to enter the name of an output file. I believe PRN: works
for the printer and CON: works for the screen.
The next (and final) prompt causes the most problems with users.
It is asking for four parameters, each expressed right justified in
3-character wide fields.
The first parameter is the number of sequences in the file
The second is the number of the sequences to include in the pairwise
comparisions. I see no reason no to include all so this should be
the same number as the first parameter.
The third parameter is ICHECK it deals with a how certain conficts in
mutational pathways are dealt with. At present, I'm sure how it affects
the result.
The fourth parameter is ITOSS. WHEN ITOSS=1 means that if a gap
(deletion or undetermined residue) exists in any of the given sequences
it is assumed that a gap exists for all sequences at the same site.
In the insulin example itoss=1 gives the same result as itoss=0
since the asterisk method is used to handle gaps.
Try LWL with the file insulin
I suggest using this string for the parameter prompt:
004004000000 {with icheck off}
004004001000 {with icheck on}
Notice how you have to be somewhat defensive about the format of these
numbers because of the way that Fortran deals with input.
Thus 004 means 4, while space-4-space means 40 and 4-space-space means 400.
Likewise 1 is expressed as 001. Try to avoid using any spaces at
all with this response to the prompt.
Eric Cabot, August 1989